What factors affect the medical charges?
The dataset is from kaggle and it contains the following variables:
| Variable | Description |
|---|---|
| age | age of primary beneficiary |
| sex | insurance contractor gender, female, male |
| bmi | Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,objective index of body weight (kg / m ^2) using the ratio of height to weight, ideally 18.5 to 24.9 |
| children | Number of children covered by health insurance / Number of dependents |
| smoker | Smoking |
| region | the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest. |
| charges | Individual medical costs billed by health insurance |
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.5 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x dplyr::select() masks MASS::select()
Exploring the variables
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 27.00 39.00 39.21 51.00 64.00
From the graph, we can see that the charges increase as age goes up.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.000 1.095 2.000 5.000
This plot illustrates that families with 4 and 5 children have less charges than families with less children(which is weird).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.96 26.30 30.40 30.66 34.69 53.13
From this plot, we can see that as bmi increases, charges increases.
##
## female male
## 662 676
As we can guess, there is no much differences between the insurance charges of males and females.
##
## no yes
## 1064 274
We can see that smokers have higher charges than non smoker, which tells that smoking may have a negative impact on smokers’ health!!
Building a model that predict the medical charges
##
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region +
## sex, data = insurance)
##
## Coefficients:
## (Intercept) age bmi children
## -11938.5 256.9 339.2 475.5
## smokeryes regionnorthwest regionsoutheast regionsouthwest
## 23848.5 -353.0 -1035.0 -960.1
## sexmale
## -131.3
##
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region +
## sex, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11304.9 -2848.1 -982.1 1393.9 29992.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11938.5 987.8 -12.086 < 2e-16 ***
## age 256.9 11.9 21.587 < 2e-16 ***
## bmi 339.2 28.6 11.860 < 2e-16 ***
## children 475.5 137.8 3.451 0.000577 ***
## smokeryes 23848.5 413.1 57.723 < 2e-16 ***
## regionnorthwest -353.0 476.3 -0.741 0.458769
## regionsoutheast -1035.0 478.7 -2.162 0.030782 *
## regionsouthwest -960.0 477.9 -2.009 0.044765 *
## sexmale -131.3 332.9 -0.394 0.693348
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
## F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region,
## data = insurance)
##
## Coefficients:
## (Intercept) age bmi children
## -11990.3 257.0 338.7 474.6
## smokeryes regionnorthwest regionsoutheast regionsouthwest
## 23836.3 -352.2 -1034.4 -959.4
##
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region,
## data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11367.2 -2835.4 -979.7 1361.9 29935.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11990.27 978.76 -12.250 < 2e-16 ***
## age 256.97 11.89 21.610 < 2e-16 ***
## bmi 338.66 28.56 11.858 < 2e-16 ***
## children 474.57 137.74 3.445 0.000588 ***
## smokeryes 23836.30 411.86 57.875 < 2e-16 ***
## regionnorthwest -352.18 476.12 -0.740 0.459618
## regionsoutheast -1034.36 478.54 -2.162 0.030834 *
## regionsouthwest -959.37 477.78 -2.008 0.044846 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6060 on 1330 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7496
## F-statistic: 572.7 on 7 and 1330 DF, p-value: < 2.2e-16
Chicking the model
The Residual versus Fitted plot shows that there is a concern that the relationship is non-linear.
The Normal Q-Q plot shows that the residuals are not normally distributed.
The Scale - Location plot shows that the assumption of equal variance is satisfied since the points are randomly distributed except the lower left points.
From Residuals vs Leverage plot, we can see that observation 1048 could be a potential influential observation.
#Since there is a concern that the relationship is non-linear, I can transform the age variable to the age square since the graph of age is not linear.
##
## Call:
## lm(formula = charges ~ age^2 + bmi + children + smoker + region,
## data = insurance)
##
## Coefficients:
## (Intercept) age bmi children
## -11990.3 257.0 338.7 474.6
## smokeryes regionnorthwest regionsoutheast regionsouthwest
## 23836.3 -352.2 -1034.4 -959.4
##
## Call:
## lm(formula = charges ~ age^2 + bmi + children + smoker + region,
## data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11367.2 -2835.4 -979.7 1361.9 29935.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11990.27 978.76 -12.250 < 2e-16 ***
## age 256.97 11.89 21.610 < 2e-16 ***
## bmi 338.66 28.56 11.858 < 2e-16 ***
## children 474.57 137.74 3.445 0.000588 ***
## smokeryes 23836.30 411.86 57.875 < 2e-16 ***
## regionnorthwest -352.18 476.12 -0.740 0.459618
## regionsoutheast -1034.36 478.54 -2.162 0.030834 *
## regionsouthwest -959.37 477.78 -2.008 0.044846 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6060 on 1330 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7496
## F-statistic: 572.7 on 7 and 1330 DF, p-value: < 2.2e-16
Now we can see that, The Residual versus Fitted plot shows that the relationship is linear.
The Normal Q-Q plot shows that the residuals are not normally distributed.
The Scale - Location plot shows that the there are two clusters, but assumption of equal variance is satisfied since the points are randomly distributed around the line.
From Residuals vs Leverage plot, we can see that there are many obversations could be potential influential observations.
To improve the model, we will check if there is an interaction between the explanatory variables.
## Registered S3 methods overwritten by 'parameters':
## method from
## as.double.parameters_kurtosis datawizard
## as.double.parameters_skewness datawizard
## as.double.parameters_smoothness datawizard
## as.numeric.parameters_kurtosis datawizard
## as.numeric.parameters_skewness datawizard
## as.numeric.parameters_smoothness datawizard
## print.parameters_distribution datawizard
## print.parameters_kurtosis datawizard
## print.parameters_skewness datawizard
## summary.parameters_kurtosis datawizard
## summary.parameters_skewness datawizard
## Learn more about sjPlot with 'browseVignettes("sjPlot")'.
##
## Call:
## lm(formula = charges ~ bmi * smoker, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19768.0 -4400.7 -869.5 2957.7 31055.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5879.42 976.87 6.019 2.27e-09 ***
## bmi 83.35 31.27 2.666 0.00778 **
## smokeryes -19066.00 2092.03 -9.114 < 2e-16 ***
## bmi:smokeryes 1389.76 66.78 20.810 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6161 on 1334 degrees of freedom
## Multiple R-squared: 0.7418, Adjusted R-squared: 0.7412
## F-statistic: 1277 on 3 and 1334 DF, p-value: < 2.2e-16
We can see that the interaction effect is statistically significant. From the plot, we can see that there is interaction between bmi and smoker status.
Adding the interaction between bmi and smoker status to the model
##
## Call:
## lm(formula = charges ~ age^2 + bmi + children + smoker + region +
## bmi * smoker, data = insurance)
##
## Coefficients:
## (Intercept) age bmi children
## -2453.56 264.04 22.61 512.71
## smokeryes regionnorthwest regionsoutheast regionsouthwest
## -20309.09 -581.70 -1207.01 -1227.60
## bmi:smokeryes
## 1438.11
##
## Call:
## lm(formula = charges ~ age^2 + bmi + children + smoker + region +
## bmi * smoker, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14655.4 -1918.9 -1313.4 -489.7 30333.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2453.564 857.695 -2.861 0.00429 **
## age 264.042 9.522 27.729 < 2e-16 ***
## bmi 22.615 25.620 0.883 0.37756
## children 512.713 110.266 4.650 3.65e-06 ***
## smokeryes -20309.092 1648.861 -12.317 < 2e-16 ***
## regionnorthwest -581.704 381.215 -1.526 0.12727
## regionsoutheast -1207.011 383.109 -3.151 0.00167 **
## regionsouthwest -1227.601 382.576 -3.209 0.00136 **
## bmi:smokeryes 1438.108 52.630 27.325 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4851 on 1329 degrees of freedom
## Multiple R-squared: 0.8405, Adjusted R-squared: 0.8395
## F-statistic: 875.4 on 8 and 1329 DF, p-value: < 2.2e-16
We can see that the Adjusted R-squared is higher!
Determine the prediction errors
Validation set approach,
## [1] 1100 7
## [1] 29231798
The test error for the regression model for charges on age^2, BMI, children, smoker status, BMI*smoker, and region based on the validation set approach is 28718309.
K- fold corss validation approach
## [1] 23639330
The test error for the regression model for charges on age^2, BMI, children, smoker status, BMI*smoker, and region based on the 5-fold cross-validation approach is 23639330.
We can see that the 5-fold cross-validation approach produces a lower error than the validation set approach.
LOOCV approach
## [1] "call" "K" "delta" "seed"
## [1] 23712836
The LOOCV approach produces a lower error than the validation set approaches and a higher error than the 5-fold cross-validation approach.